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Matrix approximation is a common tool in machine learning for building accurate 
prediction models for recommendation systems, text mining, and computer vision. 
A prevalent assumption in constructing matrix approximations is that the partially 
observed matrix is of low-rank. We propose a new matrix approximation model 
where we assume instead that the matrix is only locally of low-rank, leading to a 
representation of the observed matrix as a weighted sum of low-rank matrices. We 
analyze the accuracy of the proposed local low-rank modeling. Our experiments 
show improvements of prediction accuracy in recommendation tasks. 

1 Introduction 

Matrix approximation is a common task in machine learning. Given a few observed matrix entries 
{Mai,fei, . . . , Ma^^bm}^ matrix approximation constructs a matrix M that approximates M at its 
unobserved entries. In general, the problem of completing a matrix M based on a few observed 
entries is ill-posed, as there are an infinite number of matrices that perfectly agree with the observed 
entries of M. Thus, we need additional assumptions such that M is a low-rank matrix. More 
formally, we approximate a matrix M e W^^^^'^ by 3. rank-r matrix M = UV^ , where U eW^^^'^, 
V G R^^^xr^ ^ ^ min(ni, 77,2). In this note, we assume that M behaves as a low-rank matrix in 
the vicinity of certain row-column combinations, instead of assuming that the entire M is low-rank. 
We therefore construct several low-rank approximations of M, each being accurate in a particular 
region of the matrix. Smoothing the local low-rank approximations, we express M as a linear 
combination of low-rank matrices that approximate the unobserved matrix M. This mirrors the 
theory of non-parametric kernel smoothing, which is primarily developed for continuous spaces, 
and generalizes well-known compressed sensing results to our setting. 

2 Global and Local Low-Rank Matrix Approximation 

We describe in this section two standard approaches for low-rank matrix approximation (LRMA). 
The original (partially observed) matrix is denoted by M G and its low-rank approximation 

by M = UV^, where U G R^l><^ V G R^^xr^ ^ ^ min(ni,n2). 

Global LRMA Incomplete S VD is a popular approach for constructing a low-rank approximation 
M by minimizing the Frobenius norm over the set A of observed entries of M: 



Another popular approach is minimizing the nuclear norm of a matrix (defined as the sum of singular 
values of the matrix) satisfying constraints constructed from the training set: 
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Figure 1 : For illustrative purposes, we assume a distance function d whose neighborhood structure 
coincides with the natural order on indices. That is, s = (a, b) is similar to u = {a^ ^b^) if \a — a^\ 
and \b — b^\ are small. (Left) For all s G [ni] x [712], the neighborhood {s^ : d{s,s^) < h} in 
the original matrix M is approximately described by the corresponding entries of the low-rank 
matrix T{s) (shaded regions of M are matched by lines to the corresponding regions in T(s) that 
approximate them). If d{s^ u) is small, T{s) is similar to T{u), as shown by their spatial closeness 
in the embedding space R^^i ><"^2 (Right) The original matrix M (bottom) is described locally by the 
low-rank matrices T{t) (near t) and T{u) (near u). The lines connecting the three matrices identify 
identical entries: Mt = Tt{t) and Mu = Tu{u). The equation at the top right shows a relation 
tying the three patterned entries. Assuming the distance d{t^u) is small, e = Tu{t) — Tu{u) = 
Tu{t) — Mu{u) is small as well. 



where Ha : R^^ ^ ]^ni xn2 ^^e projection defined by [U^{M)]a,i, = Ma^i, if (a, b) e A and 
otherwise, and || • ||f is the Frobenius norm. 

Minimizing the nuclear norm ||X||* is an effective surrogate for minimizing the rank of X. One 

advantage of ^ over ([T]) is that we do not need to constrain the rank of M in advance. However, 
problem ([T]) is substantially easier to solve than problem ([2]). 

Local LRMA In order to facilitate a local low-rank matrix approximation, we need to pose an 
assumption that there exists a metric structure over [ni] x [712], where [n] denotes the set of integers 
{1, . . . , n}. Formally, d{{a^ 6), (a', b')) reflects the similarity between the rows a and a' and columns 
b and b' . In the global matrix factorization setting above, we assume that the matrix M G R^ix^2 
has a low-rank structure. In the local setting, however, we assume that the model is characterized by 
multiple low-rank rii x 712 matrices. Specifically, we assume a mapping T : [rii] x [712] R^i x"^2 
that associates with each row-column combination [ni] x [77,2] a low rank matrix that describes 
the entries of M in its neighborhood (in particular this applies to the observed entries A): T : 
[ni] X [77,2] R^ix^2 where Ta,b{(^jb) = Ma^h- Note that in contrast to the global estimate in 
Global LRMA, our model now consists of multiple low-rank matrices, each describing the original 
matrix M in a particular neighborhood. Figure [T] illustrates this model. 

Without additional assumptions, it is impossible to estimate the mapping T from a set of m < 72177-2 
observations. Our additional assumption is that the mapping T is slowly varying. Since the domain 
of T is discrete, we assume that T is Holder continuous. Following common approaches in non- 
parametric statistics, we define a smoothing kernel Kh{si^S2), where si,S2 G [771] x [772], as a 
non-negative symmetric unimodal function that is parameterized by a bandwidth parameter h > 0. 
A large value of h implies that Kh{s^ •) has a wide spread, while a small h corresponds to narrow 
spread of Kh{s, •). We use, for example, the Epanechnikov kernel, defined as Kh{si^S2) = |(1 — 

d{si,S2)'^)l{disuS2)<h}- We denote by i^^'"'^^ the matrix whose (i, j)-entry is Kh{{a, b), {ij)). 

Incomplete SVD ([T]) and compressed sensing ^ can be extended to local version as follows 

Incomplete SVD: f{a, b) = arg min Hi^^""'^^ nA(X - M) ||f s.t. rank(X) = r (3) 

X 

Compressed Sensing: f{a, 6) = arg min ||X||* s.t. Ili^^""'^^ nA(X - M) ||f < Q^, (4) 

X 
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Figure 2: RMSE of global-LRMA, local-LRMA, and other baselines on MovieLens lOM (Left) and 
Netflix (Right) dataset. Local-LRMA models are indicated by thick solid lines, while global-LRMA 
models are indicated by dotted lines. Models with same rank are colored identically. 



where denotes a component- wise product of two matrices, [A B]ij = AijBij. 

The two optimization problems above describe how to estimate T(a, b) for a particular choice of 
(a, b) e [rii] X [77,2]. Conceptually, this technique can be applied for each test entry (a, 6), result- 
ing in the matrix approximation Ma,6 = Ta,b{(ijb), where (a, 6) G [rii] x [77,2]. However, this 
requires solving a non-linear optimization problem for each test index (a, b) and is thus computa- 
tionally prohibitive. Instead, we use Nadaraya- Watson local regression with a set of q local estimates 

T(si) , . . . , T{sq), in order to obtain a computationally efficient estimate T{s) for all s G [ni] x [712] : 

Equation ^ is simply a weighted average of T{si) , . . . , T{sq), where the weights ensure that values 
of T at indices close to s contribute more than indices further away from s. 

Note that the local version can be faster than global SVD since (a) each low-rank approximation is 
independent of each other, so can be computed in parallel, and (b) the rank used in the local SVD 
model can be significantly lower than the rank used in a global one. If the kernel has limited 
support (Kh{s^ s^) is often zero), the regularized SVD problems would be sparser than the global 
SVD problem, resulting in additional speedup. 



3 Experiments 

We compare local-LRMA to global-LRMA and other state-of-the-art techniques on popular recom- 
mendation systems datasets: MovieLens lOM and Netflix. We split the data into 9:1 ratio of train 
and test set. A default prediction value of 3.0 was used whenever we encounter a test user or item 
without training observations. We use the Epanechnikov kernel with hi = h2 = 0.8, assuming a 
product form Kh{{a^ 6), (c, d)) = K'^_^ (a, c)Kl^^ (6, d). For distance function d, we use arccos dis- 
tance, defined as d{x^ y) = arccos ((x, ||?/||). Anchor points were chosen randomly among 
observed training entries. L2 regularization is used for local low-rank approximation. 

Figure [2] graphs the RMSE of Local-LRMA and global-LRMA as well as the recently proposed 
methocfcalled DFC (Divide-and-Conquer Matrix Factorization) as a function of the number of an- 
chor points. Both local-LRMA and global-LRMA improve as r increases, but local-LRMA with 
rank r > 5 outperforms global-LRMA with any rank. Moreover, local-LRMA outperforms global- 
LRMA in average with even a few anchor points (though the performance of local-LRMA improves 
further as the number of anchor points q increases). 
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